68

Algorithms for Binary Neural Networks

cant advantages in solving probabilistic graphical models. It can help achieve information

exchange between the perception task and the inference task, conditional dependencies on

high-dimensional data, and effective uncertainty modeling. [14, 124] have been extensively

studied in Bayesian neural networks (BayesNNs). More recent developments establishing

the efficacy of BayesNNs can be found in [215, 139] and the references therein. Estimating

the posterior distribution is a vital part of Bayesian inference and represents the information

on the uncertainties for both the data and the parameters. However, an exact analytical so-

lution for the posterior distribution is intractable, as the number of parameters is large and

the functional form of a neural network does not lend itself to exact integration [16]. Several

approaches have been proposed for solving posterior distributions of weights of BayesNNs,

based on optimization-based techniques such as variational inference (VI) and sampling-

based approaches, such as Markov Chain Monte Carlo (MCMC). MCMC techniques are

typically used to obtain sampling-based estimates of the posterior distribution. BayesNNs

with MCMC have not seen widespread adoption due to the computational cost of time and

storage on a large dataset [120].

In contrast to MCMC, VI tends to converge faster and has been applied to many popular

Bayesian models, such as factorial and topic models [15]. The basic idea of VI is that it

first defines a family of variational distributions and then minimizes the Kullback-Leibler

(KL) divergence concerning the variational family. Many recent works have discussed the

application of variational inference to BayesNNs, e.g., [16, 216].

Despite the progress made in 1-bit or network pruning, little work has combined quanti-

zation and pruning in a unified framework to reinforce each other. However, it is necessary

to introduce pruning techniques into 1-bit CNNs. Not all filters and kernels are equally es-

sential and worth quantizing in the same way, as validated subsequently in our experiments.

One potential solution is to prune the network first and then perform a 1-bit quantization

on the remaining network to have a more compressed network. However, such a solution

does not consider the difference between the binarized and full-precision parameters during

pruning. Instinctively, 1-bit CNNs tend to be easily pruned, as CNNs are more redundant

before and after binarization [150]. Thus, one promising alternative is to conduct pruning

over BNNs. However, it remains an open problem to design a unified framework to calcu-

late a 1-bit network first and then prune it. In particular, due to the deterioration of the

representation ability in 1-bit networks, the backpropagation process can be susceptible to

parameter updates, making existing optimization schemes [77] fail.

To address this problem, we use Bayesian learning, a well-established global optimization

scheme [174],[16], to prune 1-bit CNNs. First, Bayesian learning binarizes the full-precision

kernels to two quantization values (centers) to obtain 1-bit CNNs. The quantization error

is minimized when the full-precision kernels follow a Gaussian mixture model, with each

Gaussian centered on its corresponding quantization value. Given two centers for 1-bit

CNNs, two Gaussians that form the mixture model are used to model the full-precision

kernels. Subsequently, the Bayesian learning framework establishes a new pruning operation

to prune 1-bit CNNs. In particular, we divide the filters into two groups, assuming that

those in one group follow the same Gaussian distribution. Then, their average is used to

replace the weights of the filters in this group. Figure 3.20 illustrates the general framework

where three innovative elements are introduced to the learning procedure of 1-bit CNNs

with compression: (1) minimizing the reconstruction error of the parameters before and

after quantization, (2) Modeling the parameter distribution as a Gaussian mixture with

two modes centered on the binarized values, and (3) pruning the quantized network by

maximizing a posterior probability. Further analysis led to our three new losses and the

corresponding learning algorithms, referred to as Bayesian kernel loss, Bayesian feature loss,

and Bayesian pruning loss. These three losses can be jointly applied with the conventional

cross-entropy loss within the same back-propagation pipeline. The advantages of Bayesian